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Ihe literature on criterloa-referenced testing is full of 
discussions concerning vhether "classical" aeasureiieac techniques 
are appropriate , vhether variance is necessary, vhether nev indices 
of reliability are needed , and the like (see, for exas^iey Voodson, 
1974^ a,b.; MillMn and Pophas, 1974; Haladyna, 1974). What appears 
to be lacking hovever, is a clear and sis^le discussion of ^y the 
problesis occur. This paper suggests that sany of the results 
obtained vhen "classical" techniques are applied to criterion- 
referenced tests 9 particularly in the context of mastery learning, 
are perfectly reasonable , in terpre cable, and should be expected^ 

Consider, for exaiqile, Kunnally's (Xunnally, 1967) discussion 
of the donain-sanpling nodel. Ihe nodel assuaies that any particular 
■easure is coi^>osed of "a random sas^le of iteas from a hypothetical 
domain of items (p. 173)." The definition of "true score" is the 
score that vould be obtained if all Items In the domain vere included 
In the measure* Ihe only other assunption required for the developDent 
of the model is that all items contain an equal amount of the "covm 

The views expressed in this paper are those of tlie authors and do 
not Imply indorsement by the U«S« Army* 



core"*, tiic skill or zztrSbute that Is neasured* Statistically, 

this isplies that the average correlation of each iteai vlth all the 
others is the saae for all iteaks. Xotice that this does not iJ^lj 
that all the inter-xtesn correlations are the sane. This description 
seesis to fit very nicely vith what is required for criterion- 
referenced rests. In fact, it is alsost the saae as the situation 
Mills^n (Millsan, 1973) described in his reviev article on dosain- 
r<^crcnced aieasures. 

It is possible to derive Cronbach's (1951) coefficient alpha 
froa the doiaain-sasqilins zaodel without saking any further assuaptions 
about the nature of the doaain. As Cureton (1958) pointed out, the 
only required assumption is that for any fiven k-iten test ''there is 
«t least one possible division of the k-itezss into the ttfo half-tests, 

and Xj^, such that these two half-tests are equally reliable and 
equally variable (p. 725)." Since, in general, the particular 
partition of itesss that aaeets this requireMnt is not known it is 
also necessary to assuae "that trie aean within-half-test ite« co- 
variances are not only equal to each other but are equal also to the 
■can between-half- test itea covariances (p. 726)". However, these 
asstmptions are aerely a restateoent in teras of the covariances of 
the basic assuiiption of the douin-saapling Aodel that the average 
correlation of each item with all the others is the saM for all 
Iteas. If this is the case, then the question is. Why doesn't the 
■odel work with criterion-referenced tests? tfe aaintain that the 



■Ddel does v^rk. Hie problciss occur la Interpretlnc the results. 
If ooe considers the results of applylxit classical teidiniques in 
teras of the nature cf the itess and the people bein^ tested, they 
are excctlj vhat vould be escpected. 

To illustrate the need for careful consideration of the data 
source before statistical results are interpreted, tvo testing 
situations are described. Ihese exas^les coae froo a purely luLlitary 
context (ia fact, they involve tank gunnery skills), but the tests 
clearly fit the requireoents of criterion-referenced tests. There- 
fore, the tests themselves and the results of the analyses reported 
here should be thought of in the general context of criterion- 
referenced testing and not in the restricted context of sdlitacy 
testicg proble&s. 

Ihe data co»e fron two separate studies, each designed to 
investigate different aspec s of the use of siwlation devices to 
train tank gunners. The first stufjy '""cvers, et al, 1975) investigated 
the contribution cf live firing, as a coaponent of the training progran, 
to gunnery proficiency. Trainees were divided into four training 
groups and were allowed tventy-four training "rounds" before taking 
the criterion test. The proportions of live fire and laser sinulated 
fire included during the training differentiated the training groups. 
Although the four groups received different types of training, the 
m>st critical aspect of the training, the ntmber of practice rounds. 



vas dmstaat for each group. Iherefore, it wnild be reasonable to 
expect that the relative perfoxaaaces of the traininc t^oitps would be 
siad.lar. In fact, the study failed to show differences in fierfonumce 
on the criterion test as a function of the training pr^rgraa. 

The second study (lose, et aL^ in press) vas designed to compare 
the relative effectiveness of tvo tank gunnery training devices* 
Seven training groups were included in the study* Three groups vere 
assigned to each training device and trained to proficiency levels on 
the device of 30Z, 50Z, and 702 hits. Ihe seventh group received no 
training. Since the groups received different aaounts of training 
and achieved different proficiency levels in training, it would sees 
reasonable to expect different levels of performance on the criterion 
test. In fact, the study did reveal that sone of the variance in 
perfomance on the criterion test could be attributed to differential 
levels of proficiency in training. 

The criterion test in each study required gunnery trainees to 
dcsonstrate their ability to hit aoving targets with the aain gun of 
the M60A1 tank. In the first study, each gunner fired eight rounds, 
esch of which was scored hit or niss. In the second study, the test 
corslsted of twelve rounds. In the first study, the range to the 
taccet was 1400 jaeters during its movemtnt from left to right, and 
1200 neters during its aoveaent froM right to left. The corresponding 
ranges for the second study were 750 aeters and 700 aeters. With the 
above exceptions of range and direction of travel, each test trial 



vas Identical. Both tests seem to fit Kimnally's dona in saapliss 
aodel and Mlllman's requirement that all itesK comt froa the sane 
doMin. Yet the analyses of the test scores yield vastly different 
and seemingly contradictory results. 

Simary statistics for the criterion tests arc provided in 
Tables 1 and 2 for the first and second study, respectively. Figure 1 
compares the frequency distributions. The values for and the 

average interitea correlation are particularly perplexing. The first 
study's results indicate a IClt-20 of .0048 and an average interitev 
correlation of .0027. Further, of the 28 intercorrelations obtained 
from the itea intercorrelation zutrix, only 3 are significantly 
different froa zero at the .05 level of significance. The second 
study yielded very different results. The value of KR-20 vas .7136, 
and the average interitea correlation was .1716. The correlation 
aatrix for the second study shoved 41 of the 66 possible inter- 
correlations significantly different froa zero at the .05 level. 
Wiy should tvo such apparently similar tests shov such different 
results? Perhaps aore iaportantly, how does a test constructor 
interpret these results and hov can he use thea to design better 
tests? 
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TEST SOKE 



FtEQUEMOf 



ITEM 



PIFFICULTT (P) 



0 11 .345 

1 10 2 .418 

2 U 3 -236 

3 18 4 .345 

4 8 5 .400 

5 6 6 .418 

6 17 .273 

7 0 8 .364 

8 0 
55 

VardLaace: 1.797 
KK-20: 0.0048 

Average interiten correlation: 0.0027 
Table 1. Suaaary statistics for the criterioa test for Study 1 



TEST SCORE 


FREQUENCY 


ITEM 


DIFFICULTY (P) 
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i' 


.429 
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.487 
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10 
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.526 
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12 
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.474 
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13 
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.500 
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16 
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-558 
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20 


7 - 


.545 
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16 
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.662 
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19 
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.578 
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14 


10 


.636 


10 


14 


11 


.604 


11 


11 


12 


.630 


12 


5 
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Variance: 8.402 
ICR-20: 0.7136 

Average Interitem correlation: 0.1716 

Table 2. Suanary statistics for the criterion test for Study 2 
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Measureoent requires that there be saapling axaong people and 
aAong Itcais. Typically, the sampling problcn with regard to people 
Is ignored during test developncnt or itea writing before validation. 
Okie siisply assusies that sufficient people are available and that there 
is sufficient variability in their abilities to allow for item 
characteristics to be studied. Essentially all of the test developer's 
time is spent in saapling itens and assuring hiaself that they are good 
ones- However y in a criterion-referenced testing/mastery learning 
context both sanpling probleas must be considered. What are the 
implications for ZBcasurement theory if the examinee population has 
very little variability in ability? Haableton and Traub (1973) in their 
article on latent trait models provide at answer to this question in the 
discussion of "local independence": "... in an infinite subpopulation 
of examinees 9 all of whom are at the same ability level , scores on one 
test item will he statistically independent of scores on another (if 
the assumption of local independence holds) . It will be recognized 
that the assunpticn of local independence does act iaply that test 
items are uncorrelated over the total group examinees. Correlations 
between items measuring the same ability will, in general , exist when- 
ever the examinees responding to the items differ on the underlying 
ability measured by the test (pp. 195-196)." The data from the first 
study Illustrate the results of using a good, content valid criterion- 
referenced test with a highly homogeneous examinee population. The 
second study illustrates the results of using a similar test with a 



more traditional relatively heterogeneous examinee population. The two 
sets of data behave In precisely the manner described by Hambleton 
and Traub. In a mastery learning context a homogeneous examinee 
population Is expected or, produced by the training. Perhaps if this 
type of situation prevails, a criterion-referenced test will produce 
low values of KR-20 and zero inter item correlations. Certainly, such 
results should not be cause for alarm. 

One final interesting feature of these data should be mentioned. 
If the series of test trails is considered a series of independent 
Bernoulli trials, then the average proportion of hits is an unbiased 
estimate of the group ability. For the first study the average 
proportion of hits was .35. For -the second study the average pro- 
portion of hits was .55. One can compare the theoretical character- 
istics of a series of Bernoulli trails with p=.35 and p=.55 to the 
observed data. In the case of the first study the Kolmogorov-Smirov 
one-sample test (Siegel, 1956, p. 47) indicates that the probability of 
obtaining the scores observed under the null hypothesis chat the 
score distribution is Bernoulli with p=.35 is greater than .20. In 
other words, it seems reasonable to explain the observed test results 
in terms of a highly homogeneous group of individuals, each of whom 
has a .35 chance of hitting the target. 

For the second study the null hypothesis is that the score 
distribution is Bernoulli with p«.55. The Kolmogorov-Smimov test 
indicates that the probability of obtaining the observed data under 



the null hypothesis is less than .01. Hence, the assumption that the 
second study dealt with a heterogeneous examinee population seems to 
be reasonable. In fact, the data from the second study fit a negative 
hypergeometric distribution with parameters n=12, a=2.732, and b=13.217 
with a probability greater than ^20 using the Kolmogorov-Smirnov test, 
and provide a good example of an application of the binomial error 
model discussed in Lord and Novick (1968, pp* 508-529). Thus, these 
tests support the conclusions regarding the relationship of the examinee 
group, interitem correlations, and local independence in the Hambleton 
and Traub paper. 

The purpose of this paper is not to advocate a particular procedure 
for evaluating criterion-referenced tests. Rather, it is to remind the 
practitioner that mc^e than statistics and measurement theory are 
required in order to interpret test results meaningfully, and to provide 
two examples which illustrate the importance of considering the entire 
testing situation in making inferences about a particular test. While 
the areas of the relationship between examinee ability and test 
reliability and the implications of item independence have been 
addressed in the literature, it appears that the widespread application 
of criterion-referenced testing requires that these subjects be 
reexplored. Particularly, the statistical properties of restricted 
distributions, such as the abilities of students in a mastery learning 
program, should be reexamined to determine their applicability in 
interpreting criterion-referenced test results. Perhaps by explicitly 
defining what is known, the directions for further research will 
become more clear* ^ ^ 
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